All-Word Prediction As The Ultimate Confusible Disambiguation

نویسنده

  • Antal van den Bosch
چکیده

We present a classification-based word prediction model based on IGTREE, a decision-tree induction algorithm with favorable scaling abilities and a functional equivalence to n-gram models with backoff smoothing. Through a first series of experiments, in which we train on Reuters newswire text and test either on the same type of data or on general or fictional text, we demonstrate that the system exhibits log-linear increases in prediction accuracy with increasing numbers of training examples. Trained on 30 million words of newswire text, prediction accuracies range between 12.6% on fictional text and 42.2% on newswire text. In a second series of experiments we compare all-words prediction with confusable prediction, i.e., the same task, but specialized to predicting among limited sets of words. Confusable prediction yields high accuracies on nine example confusable sets in all genres of text. The confusable approach outperforms the all-words-prediction approach, but with more data the difference decreases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Token merging in language model-based confusible disambiguation

In the context of confusible disambiguation (spelling correction that requires context), the synchronous back-off strategy combined with traditional n-gram language models performs well. However, when alternatives consist of a different number of tokens, this classification technique cannot be applied directly, because the computation of the probabilities is skewed. Previous work already showed...

متن کامل

Scalable classification-based word prediction and confusible correction

ABSTRACT. We present a classification-based word prediction model based on IGTREE, a decision-tree induction algorithm with favorable scaling abilities. Through a first series of experiments we demonstrate that the system exhibits log-linear increases in prediction accuracy and decreases in discrete perplexity, a new evaluation metric, with increasing numbers of training examples. The induced t...

متن کامل

رفع ابهام معنایی واژگان مبهم فارسی با مدل موضوعی LDA

Word sense disambiguation is the task of identifying the correct sense for the word in a given context among a finite set of possible sense. In this paper a model for farsi word sense disambiguation is presented. The model use two group of features: first, all word and stop words around target word and topic models as second features. We extract topics from a farsi corpus with Latent Dirichlet ...

متن کامل

Mining Rules for Word Sense Disambiguation

This paper describes the automatic generation and the evaluation of sets of rules for word sense disambiguation (WSD) in machine translation. The ultimate aim is to identify high-quality rules that can be used as knowledge sources in a relational WSD model. The evaluation was carried out both automatically, by means of four objective measures (error, coverage, support and novelty), and manually...

متن کامل

Korean Word-Sense Disambiguation Using Parallel Corpus as Additional Resource

Most previous research on Korean WordSense Disambiguation (WSD) were focusing on unsupervised corpus-based or knowledge-based approach because they suffered from lack of sense-tagged Korean corpora.Recently, along with great effort of constructing sense-tagged Korean corpus by government and researchers, finding appropriate features for supervised learning approach and improving its prediction ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006